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ABSTRACT 

We  discuss  and  analyze  issues  related  to  the  design  of  pseudorandom  number  gen- 
erators (prn's)  for  MIMD  (multiple  instruction  stream/multiple  data  stream)  paral- 
lel processors,  which  are  very  well  suited  for  Monte  Carlo  calculations.  We  are 
concerned  to  ensure  reproducibility  of  runs,  to  provide  very  long  sequences,  and  to 
assure  an  adequate  degree  of  independence  of  the  parallel  streams.  We  consider 
the  class  of  linear  congruential  generators 

*n+l,i  =  <*Xn,i+bi   mod  m 

and  analyze  the  effect  that  different  choices  of  b(  have  on  the  correlation  properties 
between  such  streams.  We  derive  a  spectral  test  v,  for  t  parallel  linear  congruential 
generators,  a  modification  of  Knuth's  Algorithm  S.  From  this,  we  prove  a  good 
lower  bound  for  v2=       min        v2(/j)  for  certain  choices  of  bi's.    The  set  of  the 

all  pairs(ij)  . 

largest  r  primes  pit  t—  l,...,r  satisfying  p-x  <  ^m/2,  where  m  is  the  period  length  of 
each  generator  gives  a  lower  bound  0(w%)  to  the  correlation  between  a  pair  of 
corresponding  elements  in  any  two  streams.  An  alternative  choice,  bt  =  dl  mod  m 
for  d  =  mH+ 1  gives  a  bound  0(jn  /(*—  1))  which  will  be  satisfactory  for  small 
numbers  of  streams.  Finally,  we  construct  a  spectral  test  for  correlations  between 
xni  and  xn+k<i  +  i,  but  derive  no  analytic  prescriptions  from  it. 

1.   Introduction 

Designing  pseudorandom  number  generators  for  current  and  future  parallel  processors 
presents  interesting  new  challenges.  We  are  interested  in  generators  for  the  class  of 
machines  called  MIMD  (multiple  instruction  stream/multiple  data  stream)  in  which  a 
number  of  processors  are  capable  of  essentially  independent  computations.  Such  architec- 
tures are  very  well  suited  for  Monte  Carlo  calculations,  since  potentially  independent  reali- 
zations or  "histories"  may  be  followed  on  each  processor.  They  are  much  more  suitable 
than  the  current  generation  of  "vector"  supercomputers  for  those  Monte  Carlo  calculations 
with  substantial  logical  complexity,  i.e.,  those  with  data-dependent  or  random  number 
determined  branches.  One  example  of  an  MIMD  architecture  is  the  NYU  Ultracomputer; 
this  is  a  projected  architecture  that  will  be  scaleable  up  to  thousands  of  processors  with 
features  that  make  it  attractive  for  Monte  Carlo  calculations  of  all  kinds.  The  issues  dis- 
cussed in  this  paper  are  applicable  to  other  parallel  machines  as  well,  including  existing 
vector-parallel  machines  (such  as  the  various  Crays,  multi-pipe  Cyber  205s,  and  the  pro- 
jected ETA  10,  the  IBM  3094-400,  and  various  hypercube  machines  [1]).  Nevertheless,  the 
emphasis  will  be  on  problems  associated  with  very  large  numbers  of  processors. 


The  new  problems  posed  by  parallel  computing  are  the  following:  for  asynchronous 
and  multiprogrammed  machines  special  care  is  needed  to  ensure  reproducibility  of  runs; 
very  long  sequences  may  be  necessary;  finally,  attention  to  the  independence  of  the  paral- 
lel streams  used  on  different  processors  is  required. 

In  what  follows  we  distinguish  between  the  computational  "processes"  and  the  physi- 
cal computers  of  which  the  ensemble  is  made.  This  is  necessary  because  in  the  class  of 
architectures  that  we  consider,  there  is  no  predictable  mapping  of  processes  to  processors. 

To  simplify  somewhat,  two  approaches  can  be  considered  for  the  generation  of  pseu- 
dorandom sequences:  either  one  (or  a  few)  processes  are  dedicated  to  generating  the 
numbers  and  other  processes  consume  their  output,  or  else  each  process  generates  its  own. 
The  first  is  less  desirable  from  several  points  of  view.  It  may  be  difficult  to  balance  the 
speed  of  consumption  and  production;  it  may  be  difficult  to  communicate  the  sequences 
from  producers  to  consumers.  Above  all,  if  there  is  any  degree  of  asynchronism,  the  pat- 
tern of  assignment  of  random  numbers  to  processes  may  not  be  reproducible  from  one  run 
to  another. 

It  has  been  argued  that  reproducibility  is  not  necessary  for  Monte  Carlo  calculations. 
After  all,  if  the  results  are  not  predictible,  what  is  bad  about  the  trivial  additional  feature 
of  non-reproducibility  from  run  to  identical  run?  Put  another  way,  if  the  results  are  mean- 
ingful only  within  some  statistical  error  bound,  why  be  concerned  whether  a  run  can  be 
exactly  repeated?  One  answer,  at  least  for  that  large  class  of  problems  that  have  the  pro- 
perty of  logical  complexity  mentioned  above,  is  that  the  potential  for  exact  reproducibility 
is  extremely  helpful  (probably  essential)  for  debugging.  The  logical  errors  that  often  occur 
in  such  programs  are  difficult  to  isolate  if  the  run  cannot  be  exactly  retraced.  An  addi- 
tional advantage  of  exact  reproducibility  is  that  it  enhances  confidence  in  the  validity  of  a 
ported  program.  It  is  possible  to  program  these  algorithms  in  a  style  that  produces  the 
same  results  on  machines  of  different  sizes,  even  on  machines  in  which  the  number  of  pro- 
cessors available  to  a  specific  calculation  varies  unpredictably  during  the  course  of  the  run. 
It  is  necessary,  of  course,  that  the  initial  state  of  the  computation  (including  the  parameters 
of  the  generators)  can  be  set  to  be  the  same  in  alternative  runs.  This  requirement  of  repro- 
ducibility puts  an  additional  restriction  on  the  class  of  random  number  generators  that  may 
be  used.   We  assert  that  it  is  a  desirable  feature  and  we  shall  see  that  it  can  be  satisfied. 
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Another  issue  arises  from  the  consideration  of  Monte  Carlo  calculations  that  simulate 
branching  processes  or  in  which  branching  processes  are  used  as  a  technical  device. 
Natural  stochastic  processes  include  the  creation  and  destruction  of  particles  and  radiation 
(in  cascades  of  high-energy  particles,  and  in  nuclear  fission,  to  cite  two  significant  exam- 
ples.) The  straightforward  simulation  of  such  phenomena  leads  to  programs  in  which  the 
one  random  walker  may  become  two  or  more,  each  of  which  must  be  followed,  essentially 
independently.  In  addition,  effective  Monte  Carlo  treatment  of  some  problems  leads  natur- 
ally to  the  use  of  branching  processes  (e.g.  in  Green's  function  Monte  Carlo  [2]).  Impor- 
tance sampling  of  low  probability  events  is  often  accomplished  in  part  by  the  use  of  branch- 
ing. For  example,  in  the  calculation  of  the  passage  of  radiation  through  thick  media,  the 
techniques  called  "splitting"  and  "Russian  Roulette",  colorful  names  for  the  birth  and 
death  of  random  walkers,  are  used  to  enhance  the  chance  that  some  walker  will  survive 
through  the  medium  while  at  the  same  time  assuring  an  unbiassed  estimate  of  any  property 
of  the  emerging  radiation  [3].  Finally,  we  note  that  Fredricksen  et  al.  [4]  cite  several 
examples  where  calculations  are  more  reproducible  or  where  perturbations  can  be  calcu- 
lated more  precisely  when  branches  are  inserted  in  the  "natural"  flow  of  the  simulation. 

On  parallel  computers,  an  important  consideration  that  makes  for  efficient  utilization 
of  all  processors  is  "load  balancing."  That  is  the  body  of  techniques  that  guarantee  that  all 
processors  carry  out  approximately  equal  work.  In  calculations  in  which  branching 
processes  appear,  load  balancing  is  likely  to  require  that  multiple  random  walkers  that  des- 
cend from  some  parent  be  simulated  on  processors  other  than  the  one  that  simulated  the 
parent.  Then,  reproducibility  of  the  calculation  cannot  be  assured  if  the  random  number 
generators  are  associated  with  the  physical  processors.  This  is  also  true  when  the  numbers 
of  processors  varies  out  of  the  program's  control  and  the  simulations  migrate  from  proces- 
sor to  processor. 

The  necessary  alternative  is  that  random  number  generation  be  associated  with  the 
simulation  processes  (in  the  jargon  of  parallel  computing,  with  the  computational  processes 
or  "threads  of  control"  ),  that  they  migrate  with  the  simulation  to  other  processors,  and 
that  new  generators  be  created  as  often  as  necessary.  In  particular,  for  branching  walks,  a 
new  sequence  should  be  created  for  each  descendent  random  walker  that  may  migrate  to 
other  processors.  This  was  first  pointed  out  by  Zhong  and  Kalos  [5]  and  leads  to  the  con- 
cept of  a  pseudorandom  tree,  in  which  one  generator  is  used  for  "intra-process"  random 
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numbers,  another  for  initialization  of  these  new  streams.  We  suggest  that  is  it  better  to 
think  of  the  whole  calculation  to  be  mapped  onto  a  tree,  each  node  of  which  is  associated 
with  the  initiation  of  a  new  stream.  The  streams  are  not  necessarily  a  part  of  the  tree.  A 
"conventional"  parallel  Monte  Carlo  in  which  one  stream  is  associated  with  each  of  several 
or  many  processes,  but  in  which  no  new  streams  are  created  later,  will  use  only  the  left 
successors  of  the  root. 

Computers  of  these  kinds  will  be  used  for  very  large  calculations  that  will  require  very 
large  aggregate  numbers  of  random  variables.  A  standard  pseudorandom  sequence  may  be 
adequate  for  each  process.  It  will  be  best  if  each  sequence  can  be  guaranteed  to  have  no 
subsequence  in  common  with  any  other  (or  at  least  that  overlapping  sequences  should  be 
very  rare.)  Thus  it  will  be  impractical  simply  to  use  subsequences  of  consecutive  elements 
of  a  single  congruential  generator  unless  its  period  is  very  long  indeed. 

The  sequences  will  also  have  to  be  independent  enough  so  that  the  parallel  replications 
of  the  calculation  reduce  the  Monte  Carlo  variance  by  the  number  of  such  replicas.  If,  for 
example,  one  process  used  a  long  sequence  in  common  with  another  in  exactly  the  same 
way,  the  additional  computation  (producing  exactly  the  same  results)  would  not  reduce  the 
variance.  The  "efficiency"  of  the  parallel  computation  would  be  reduced.  A  more  likely, 
although  more  subtle,  problem  arises  if  there  is  some  statistical  correlation  among 
members  of  different  streams.  We  must  consider  the  issue  of  providing  thousands  (possi- 
bly many  more  in  the  case  of  branching)  of  sequences  that  are  independent  in  this  sense. 

Fredrickson  et.  al.  [4]  have  proposed  a  specific  way  to  construct  pseudorandom  trees, 
which  results  in  what  they  call  a  "Lehmer  tree".  Congruential  generators  are  used  for 
both  branches.  Starting  with  x0,  the  root  of  the  tree,  every  element  has  a  left  branch  and 
right  branch  that  are  calculated  as  follows. 

xL  =  aLx  +  bL  mod  m 
xR  =  aRx  +  bR  mod  m 

The  constants  aL,  aR,  bL,  bR  and  m  determine  the  tree.  The  right  successors  are  used 
within  a  process.  New  parallel  sequences  are  seeded  from  the  left  successor.  Thus,  the 
constants  aL,  bL  must  be  chosen  to  satisfy  the  criteria  for  a  good  multiplicative  congruential 
generator.    However,  because  all  the  right  sequences  use  the  same  parameters  except  the 
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seed,  they  are  subsequences  of  a  single  generator.  The  authors  proved  a  necessary  condi- 
tion for  non-overlapping  of  these  subsequences.  But  the  total  number  of  entries  is  at  most 
the  original  period.  Unless  a  very  long  period  is  used  (which  will  require  multiprecision 
arithmetic  on  current  machines)  the  sequences  will  not  be  long  enough  when  very  many 
branches  of  the  tree  are  encountered.   This  is  the  case  with  which  we  are  concerned  here. 

Alternative  methods  are  known  that  give  very  long  sequences.  Tausworthe  generators 
[6,7]  suggest  themselves.  Unfortunately,  they  are  difficult  to  initialize  so  that  it  is  not  clear 
how  to  provide  very  many  independent  sequences.  Furthermore,  in  preparing  for  the 
migration  of  processes,  the  entire  list  that  specifies  the  state  of  the  generator  must  be 
copied. 

Composite  generators  proposed  by  Marsaglia  [8]  will  also  yield  long  sequences.  The 
state  is  normally  shorter  than  that  in  a  typical  Tausworthe  generator,  but  again  the  theory 
for  obtaining  independent  sequences  is  not  available. 

This  paper  explores  a  simple  proposal  for  the  generation  of  pseudorandom  trees, 
namely  that  a  new  sequence  to  be  used  within  a  process  differ  from  its  parallel  siblings 
simply  in  using  a  different  additive  constant.  Our  investigations  began  with  the  observation 
that  changing  the  constant  guarantees  that  the  original  sequence  be  completely  reordered, 
so  that  overlapping  sequences  cannot  occur.  This  is  a  very  simple  algorithm,  which  satis- 
fies all  the  practical  requirements  that  we  have  set,  providing  that  some  measure  of  the 
independence  can  be  assured. 

We  explore  the  issue  of  independence,  propose  several  pragmatic  solutions.  Prelim- 
inary testing  (to  be  described  elsewhere)  supports  the  claim  that  they  are  satisfactorily 
independent  for  many  purposes. 

In  section  2,  we  derive  the  spectral  test  v,  for  t  parallel  linear  congruential  generators. 
vf  is  a  measure  of  the  largest  wave  numbers  or  smallest  scale  of  resolution,  at  which  the 
points  generated  in  the  t-dimensional  space  can  be  regarded  as  uniform.  Thus  large  vf  indi- 
cates a  high  degree  of  randomness.  The  t  linear  congruential  generators  use  the  same  mul- 
tiplicative constant  and  have  t  different  additive  constants  b{,  i=\,...,t.  The  new  spectral 
test  turns  out  to  be  a  modification  of  Knuth's  [9]  Algorithm  S.  Section  3  deals  with  the 
variance  reduction  problem  in  its  simplest  form,  namely  for  a  function  of  one  variable,  to 
demonstrate  the  need  to  choose  the  bt's  (/=l,...,r),  such  that  all  (£}  Pairs  °f  a  t-component 
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pseudorandom  vector  satisfy  the  spectral  test.  In  section  4,  we  prove  our  basic  theorem, 
which  provides  a  good  lower  bound  for  v2=       min        v2(/j)  for  certain  choices  of  bi's 

all  pairs(ij) 

and  we  then  proceed  by  suggesting  concrete  methods  for  choosing  the  t  different  b^'s.  One 
is  that  of  choosing  the  set  of  the  largest  t  primes  ph  i=\,...,t  satisfying/?,  <  ^m/2,  where 
m  is  the  period  length  of  each  generator.  Section  5  gives  bounds  for  multiplets,  for  bt's 
satisfying  certain  conditions,  whereas  section  6  deals  with  a  very  simple  specific  choice, 
bj  =  d'  mod  m  for  d  =  vm  +  i.  This  power  method,  though  very  convenient  computation- 
ally, becomes  inferior  when  t  is  large,  as  seen  from  the  comparison  we  have  presented. 
Further  related  results  are  given  in  section  7. 
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2.  The  Spectral  Test  for  Parallel  Generators 

We  here  derive  a  modification  of  Algorithm  S  (see  [10])  which  tests  the  resolution  at 
which  vectors  of  prn's  coming  from  similar  chains,  modified  only  by  different  additive  con- 
stants, can  be  regarded  as  uniformly  distributed.  We  assume  throughout  that  we  already 
have  a  "good  multiplier"  a,  so  that  each  sequence,  when  considered  by  itself,  passes  the 
spectral  tests.    This  assumption  will,  however,  not  play  any  direct  role  in  our  discussion. 

Consider 

xn+lti  =  axn>i+bt   mod  m  (2.1) 

i  =  l,...,f  ,    bj  odd,  m  =  2^,  a    a  good  multiplier  and  assume  that  each  sequence  has  max- 
imum period  (i.e.   a  mod  4  =  1)  so  that  all  values  0,...,  m-1  occur. 

Then 

n    I 

xn,i  -  *o,i  +  7« r~   rn°d  m  (2.2) 

a—  1 

where 

yt  ■  bt  +  (a  -  l)x0>i      i=l t.  (2.3) 

Since  -y,-  is  relatively  prime  to  m,  "y,_1  is  a  unique  integer  mod  m  and,  therefore,  we  can 


write 


x„j  m  7i  l  Vj  xH,l+(xoj  ~  7i  1  1j  *o,i)  mod  m-  (2-4) 


For  any  integers  s\,  .  .  .  ,  s,  ,  then 


-l  -l 

S    Sj  xn,j  ~   71      xn,l    2    Sj'7j   +    £    */(*0J   _   71     7^0,l)     mod  m 
j=\     '  7=1     '  Jm\ 

_,  ' 

m  7i   xn>1  ^Sjyj  +  C      mod  m  (2.5) 

;  =  i 

t  _ 

where  C  =  2  sj(xoj  ~  7l    7;,;c0,i)  mod  m  's  a  constant  with  respect  to  rc. 

Consider  the  (  dimensional  discrete  vector  space  V  where  each  coordinate  axis  is  of 
length    m,    i.e.    V—{X},  X    =  (xl,  .  .  .  ,xt),    and     xt    can    assume    any    of    the    m    values 

0,  1 m  —  1.    One  can  define  a  probability  measure  g(x")  for  all  f  in  V.   If  the  vectors  are 

uniformly  distributed,  then 
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*(*)  =  — • 


m 


Now,  for  any  integers  (slt...,st)  mod  m,  define 

f(slt...,st)  =  E 


therefore,  here 


r  ~  i»i 


j-i 


/(.!,. ...5,)    =    -^      X    *        W;"1'/ 

;-i i 


2iri     ' 

1         m-1        ~    -     iW 

=  3    2    « 


7-1 


m 


2-rri 

1    A  "u1   — =-v 


m' y-i  ,f0 


-  n  80 .  raod  m 

7=1  ' 


For  the  sequence  (2.1),  we  have  instead: 


2-ni  J, 


where 


•  ,xt) 


g(xi xt)  = 


(  1 


m 


if  f  €{;?„  «  =  1,  .  .  .  ,m) 


and*„  =  Ufl|1, 
Thus, 


0        otherwise 
,*„,),  xni  defined  in  (2.2). 
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2iri    ^ 
1      «       "  m    ,_,V"J 

/Ul *r)   =   —    2*  ;l        • 

n  =  \ 


or  using  (2.5) 


2  vie 


f(Sl 


.*»)  - «    m  —  s  « 


2tti         .         „ 

1    «    -— ^VSv, 


7-1 


m 


n  =  l 


But,  the  set  {xnl}^=1  is  equal  to  the  set  of  integers  {0,  .  .  .  ,m  —  \).    Hence 


2-nic 


2-ni 


f(sh  .  .  .  ,s,)  =  e 


—  2-nic 


=  e 


m      2    e  J~l 

x  =  0 


o.^r1  s^Tj mod  m 

;-i 


or 


The  modified  spectral  test  will  then  be 


1         if  "Yi     2f/T/  =  Omod  m 
0        otherwise 


min 
(*, O 

»(*, j,)*(o 0) 


{^ 


2+...+5,2 


^l+7i  172-s2+-"+'Yl  1'itst  =  0  m°d  m 


) 


(2.6) 


2irvr  is  the  magnitude  of  the  largest  wave  vector  that  is  absent  in  a  Fourier  decomposi- 
tion of  the  point  distribution  on  the  t-dimensional  unit  hypercube.  The  distribution  is 
therefore  uniform  at  a  resolution  of  l/v,  compared  to  l/m  for  the  ideal  distribution  g(X). 

(2.6)  can  be  computed  using  Knuth's  Algorithm  S  (see  [10]  p.  98,  cf.  3.3.4),  since  his 
algorithm  does  not  use  the  fact  that  the  coefficients  are  powers  of  a  . 

For  a  vector  of  t  components  and  m  =  2p  (e.g.,  0  =  48)  the  highest  accuracy  with  which 
uniformity  can  be  achieved  is  p/r  (48/r)  bits.  (If  all  248  distinct  points  are  distributed  over 
the  t-dimensional  unit  hypercube,  with  spacing  2_/,  then  this  implies  248  =  (21)'  or 
/  =  48/r) . 
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We  will  show  later  that  not  only  can  this  uniformity  be  achieved  for  a  2  or  3  com- 
ponent pseudorandom  vector  but  it  can  be  achieved  simultaneously  for  all  L)  pairs  or  L  J 
triplets,  respectively,  chosen  from  a  t  component  pseudorandom  vector. 
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3.   The  Spectral  Test  for  Minimum  Variance 

In  a  frequent  type  of  computation,  the  problem  is  how  to  choose  the  bt's  or  ("y,'s) 
/  =  1,  .  .  .  ,t  such  that  all    [2)  pairs  (xni,  xnj)  are  uniformly  distributed  at  high  levels  of 
accuracy.   This  problem  arises  in  the  following  fashion: 
Given 

*l,lt  *2,1.  ■  ■  ■  -xn,\>  ■  ■  •  -*m,l  obtained  by  using  bx 


xi,t'x2,t>  ■  ■  ■  >xn,t>  ■  ■  ■  <xmj  obtained  by  using  b, 
Let  /(j)  be  a  random  variable  assuming  the  values  for  /(x^i),  .  .  .  ./(*m,i) 
/(2)   be  a  random  variable  assuming  the  values  for  f(xi^)>  ■  ■  ■  >f(xm,l) 


/(,)  be  a  random  variable  assuming  the  values  for /(;t1)f),  .  .  .  ,f(xmt). 
Then 


War 


2  /(/)}  =  2  Varfu)  +  2  ^v(f{i),fU)) 

U=l         )         i=l  ij 

l* J 


In  order  for  the  generators  to  behave  independently,  with  respect  to  this  computation,  we 
require 

cov(/(0'/c/))  =  °    for  a11  Pairs  ^'^-  (3-^ 

We  will  show  now  that  (3.1)  is  reduced  to  a  condition  on  the  function  /  that  depends  on  the 
values  of  v2(zj)  for  all  [2)  pairs  for  (/,;')  . 

We  can  represent /(x)  as: 

fix)  =    2  //  e  m 
1=0 
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since  f(x)  is  a  periodic  function  of  period  m.    For  simplicity,  let  us  consider  the  pair  (1,2) 
and  compute  covif^J^)) 

COv(f(1)j{2))   =  -£■  2  f(Xj,l)f(Xj,2)   -^2  /(*;,l)/(*;,2) 
)  m       J 

but,  since  .*,•  2  =  TfS^.l  +  c  (  see  (2.4)  ). 

„       M-i^f  .^  +  ""^  +  l'c)  -±T//  «^  +  ^ 
cov(/(1),/(2))  -  —  2  ////■«  w2  2.  J i It  e 


},l,t  "'        M 


2tii7'c 
1 


=    —  m1flffe  80,(/  +  /7rLy2)  mod  m         /(^O 


w      i.r 


or 


2in  ,, 
/  c 


COV(/(1),/(2))    =  X        ////'«  80,(/  +  /'7rS2)modm  (3.2) 

1,1 

3(i,i>(o,o) 


Define 


v2(l,2)  =      min     |V/2  +  /'2  |  /  +  7l  i^2/'  =  o  mod  w  J  (3.3) 

9(/,l>(0,0) 

Consider  any  term  on  the  r.h.s.  of  (3.2)  and  choose  /  -  max(|/|,|/  |). 

If  l<^vp~ then  i+i'yi1^*0'  hence<  8o,(/+/wrs2)modm  =  °-  But  if  l-^7i~ then 

8    ,     ,  _,   ,  is  not  necessarily  zero  and  we  therefore  need  to  have  /,  =  0  to  satisfy 

0,('  +  '*yi  hi)  m0°  m 

covif (i)J(2))  =  0  in  (3.2). 
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We  conclude,  therefore,  that 


^2  -2=Lfc 

fix)  -      2     fi*     m  (3.4) 


v2 

will  satisfy  (3.1):    only  functions  with  frequency  components  <  —f=  are  guaranteed  to  give 


2  cov(f^,f(j))  =  0  .   The  v2  in  (3.4)  is  now 


v2  =    _  min        v2(ij)  n  ~ 

all  pairs  fij)  1-5 -J ) 
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4.   Basic  Theorem 

The  above  demonstrates  the  need  to  choose  the  t  bt's  (or  7/5)  such  that  all  flj 
pairs  satisfy  the  spectral  test,  i.e.  such  that  all  pairs  (x„ti,x„j)  i<j  are  uniformly  distri- 
buted at  the  highest  level  of  accuracy  achievable.  This  will  be  accomplished  if  v2  is  of  the 
order  m%. 

To  accomplish  this  let  us  first  consider  the  case  that  xi0  =  0  for  all  i  =  l,...,f.  Then 
from  (2.3)  -y,  =  bt.   To  choose  the  b/s  we  rely  on  the  following: 

Theorem  1 

Let  bi<m,   bj<m  and  g.c.d.(bhbj)  =  1 
Then 


and 


v|(/J)  =  bf  +  bj  if  bf  +  bf  <  m 


V2('V)  2=  7TTT7  if  b?  +  b?^m 

bt   +  bi 


Proof 


vfO'J)  =      nrin      Is}  +  s\  \  s-^b,  +  s2bj  =  0  mod  m\ 
a(vO     I  ) 

slb[  +  s^bi  =  0  mod  m  is  equivalent  to 

•Sl^i  +  sibj  =  km  k  —  integer  (4.1) 

Since  (bitbj)  =  1,  (4.1)  is  satisfied  for  k  =  0  by  |  Jj  |  =  a  |  bj  \,  \  s2  |  =  a  |  bf  |  for  a  an 

integer  whence 

min  Vsj  +  s\  =  VbfTbf 

b(svs2) 
(j^2)  *  (0,0)  &  k-0 


If 


k  #  0        Vbf^~b~J  VyfTTJ  >    5^,  +  ^2/7; 


>:  /n     or 
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sl  + 

4 

> 

m2 

bf 

^j 

C( 

orollary 

1 

If 

bt 

< 

Vm/2, 

h 

< 

Vm/2 

Then 

v|(i 

\j) 

= 

bf 

+  4 

Q.E.D. 


g.c.d.  (bi,bj)  =  1 


Corollary  2 

Let        b{  =  aVJJi,     /?,  =  p^m ?  g.c.rf.    (fc.,fy)  =  1    and    a2  +  (S2  <  1 

Then  v22(i,j)  =  (a2  +  (J2)  m 

Notice  that  Theorem  1  can  be  easily  generalized  to  the  case  g.c.d.  (bhbj)  =  c  in 

which  case,  we  obtain 

vid  J)  =  -\(b2  +  bj)        if  b2  +  bf  <  mc2 
c 

V2<f^  ~    ull  u2        if   b?  +  b2^rn 
bi   +  bj 

For  the   case  x0,,  =  0    i=\,  .  .  .  ,t  we  can  now  take  advantage   of  Theorem   1  by 
choosing  the  t   bt' s  such  that 

bi  <  ^/^/2 

and  g.c.d  (bhbj)  =  1  for  all  pairs  (i,j)    1  <  z  <  ;'  <  f.    This  can  be  accomplished  in  several 
ways  e.g.: 

Let  pi  be  the  largest  prime  such  that/?!  <  'V mil 
p2  be  the  largest  prime  such  thatp2  <  Pi 


p,  be  the  largest  prime  such  that  p,  <  pt-\ 
Choose  bi  =  pj  i  =  1,  .  .  .  ,t 
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Then 


vl(ij)  =  bf  +  bj 


and 

V2  =  y/p?  +  P?-i 


Another  method  is  that  of  mixing  primes  of  the  order  of  "^  mil  with  powers  of  small 
primes  e.g.  3  \  5  2,  7  \  .  .  .  ,  etc.  such  that  3  ',  5  ',  7  J,  ,  .  .  ,  are  of  the  order  of  vm/2, 

Let  us  note  that  the  restriction  *,-  0  =  0  i  =  1,  .  .  .  ,t  is  unnecessary.  If  *,o  ¥=  0  for 
some  i,  then  using  (2.3)  we  get  -y,-  =  (a  — I)  x0i  +  £>,•  =  rm  +  v,-,  r  integer  and  we  will 
choose  bj  to  guarantee  that  v,  <  "^ mil     and     g.c.d.  (v,,v,)  =  1  for  all  /#;'  . 
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5.   Bounds  for  Multiplies 

The  problem  of  creating  a  uniform  distribution  for  a  single  2-vector  can  be  generalized 
in  another  direction.   We  will  devise  a  method  of  choosing  three  bi's  such  that 

v3*  =  0(mv3); 

of  course,  as  explained  in  [10]  (pages  90,  91)  having  a  good  lower  bound  for  v3  does  not 
imply  a  good  lower  bound  on  v2  .  Taking  this  warning  into  account,  we  now  present  the 
following  result: 

Theorem  2 

Let 

xQti  =  0(1)   for  all  i  (5.1) 

Define  two  finite  sets  of  primes^ 

Set  I  =  ■  pi  :  p i  prime,  pl  =  0(m2/3),  and  pt  <  m2'3 

Set  II  =  \ps  :  ps  prime,  ps  =  0(mv3),  and  ps  <  m1/3[ 


Then 


v2(l,2)    ^    0(m1/3)  (5.2) 


whenever  ^i,/?2  are  either  members  of  the  same  set  or  of  different  sets. 

v3(l,2,3)    >    0(mV3)  (5.3) 

whenever  b^  e   set  I  ,  b-i,  Z?3   e   set  II. 


(1)  Note  that  this  assumption  is  not  necessary  and  as  before  it  is  only  used  to  simplify  the  statement  of  the 
Theorem  and  the  proof. 

(2)  "prime"  can  be  replaced  by  every  pair  of  elements  that  belong  to  I^J  II  that  have  no  common  divisor. 
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Corollary 


v2  =  0(m1/3) 


v3  =  0(m1/3) 

whenever  one  of  the  b's   €     set  I  and  the  other  two  b's   e     set  II. 
Proof  of  Theorem  2 

To  prove  (5.2)  let  bt  =  a,7«1/3     a,  <  1      i  =  1,2. 
Then  by  Theorem  1 


v2(l,2)  =  m1/3Va?  +  <x\ 


Let 


fej  =  axmm,     b2  =  a3m2/3  a,-  <  1      i  -  1,3 


Then        v|(l,2)  &      2    2/3 


afro""  +  a3m4/3   "  '    af 


.1/3 


■IB 


2/3 


or         v2(l,2)  S:  dmJ 

Let         bx  =  a4m2/3,  fo2  =  a3m2/3  a,-  <  1      i  =  3,4 

Then 


m 


V2(U)-    (a2  +  a4V4/3         a32  +  a2 

which  proves  (5.2). 

To  prove  (5.3) 

Let  bt  =?>im2J\     b2=  02m1/3,      b3  =  (J3m1/3 

where  (3,-  are  constants  of  order  1. 


m 


2/3 


v3(l,2,3)  =         min         |V ' s\  +  s\  +  si 


s\bx  +  s2b2  +  s3b3  =  0  mod  m 


Since  s^bx  +  s1b1  +  s3b3  =  0  mod  m,     let     s1bl  +  s2b2  +  s3b3  =  km 
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Case  1:     k  =  0;  then 


s2b2  +  S3b3  ~  ~s\bi 


If         sJ*  0      then 


si  +  si  +  si 


S2b2  +   ^3^3 

1/2 


^  (^i  +  ^32)y2  * 


^  &i         and  we  get 
bi 


Pjm 


2/3 


or 


v3(l,2,3)  >  cam1/3        where  cx  = 


Vbj  +  b\        mmV&  +  P32 

Pi 


V*22  +  p32 


If  5-j  =  0  then  using  Theorem  1,  v3(  1,2,3)  =  Vp|  +  (32 
Case  2:      it  #  0;  then 


m1/3. 


(52  +  s\  +  s%)*  (b\  +  b\  +  bl)%  >  |  Sib-i  +  s2b2  +  s3b3  |  >  m 


or 


v3( 1,2,3)  > 


m 


>  ^m^3 


Vp2/n4/3  +  P22^2/3  +  P^3   '      Pl 

This  completes  the  proof  of  the  Theorem. 

Theorem  2  can  be  generalized  to  higher  spectral  tests  in  the  following  way: 
Let 


.1/4 


.1/4 


1/2 


bi  =   Pimx/\  Z>2  =   P2™      .  &3  =   03™       .  fo4  "   04™ 


.3/4 


where  g.c.d.(bi,bj)  =  1      for  all  i  #  ;'  and   p,  /  =  1, 

0,  <  1 

Then 

v2(/j)  >  0(m1/4) 

v3(ij,*)  ^  0(m1/4) 

v4(  1,2,3,4)  ^OC/m174) 


.  ,4  are  constants   such  that 
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For  the  case  bx  =  Pim175,  b2  =  02™  .  *>3  =  03™  .  &4  =  04m3/5,  2>5  =  p5m4/i  satisfy- 
ing the  conditions  g.c.d.  (b^bj)  =  1  for  all  i  =h  j,  p,  <  1  constants,  ij  =  1,  ...  ,5  we 
get 

v2(i,j)  >  0(m1/5) 

v3(ij,k)  ^  0(m1/5) 

v4( /,;,*,/)  >  0(mV5) 


v5(  1,2,3,4,5)  ^0(mv5) 
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6.    Method  of  Successive  Powers 

Our  method  of  choosing  primes  may  sometimes  present  certain  inconveniences,  such 
as  computing  and  storing  our  necessary  lists  of  primes  and  retrieving  them  in  a  reproduci- 
ble way  in  parallel.  It  would  be  valuable,  therefore,  to  consider  a  selection  of  y^'s  that 
reduces  the  computation  of  vf  to  a  previously  solved  problem. 

The  method  now  suggested  for  producing  a  large  value  of  v,  where  m  =  2®  is  to 
choose 

=  jt-1 


7j  =  dl      mod  m      for  i  =  1,2,  .  .  .  ,t 


where  d  is  odd. 
Then 

v,  = 


min 

(Ji *, 

K', s,)*(.o 0) 


\Vsj  +...+  s}  \sx  +  ds2  +...+  d'-^s,  =  0 


mod  m 


and  in  particular 
v2(ij) 


m 
3  (y )*(0,0) 


in     [y/sf  +  sf 

,sj)      (  J 


Sid' 


i-l 


+  sjdj 


-l  _ 


0  mod  m 


(6.1) 


Assuming  w.l.o.g.  i<j  then 


v2(i,j) 


min 

3(V)#(0,0) 


Vs?  +  sf  I  Si  +  sjdj   '  =  0  mod  m 


We  are  looking  for 


v2  =    min  v2(i,j) 


Let  d=vm  +  \t  then  for  every  integer;? 

dp  =  p  vm  +  1  mod  m  (this  assumes  m  =  2p  with  £  even) 


or 
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v2(/J)  =      min     \Vs?  +  sj    st  +  sj  (j-i)V^  +  \] 

3(V;)#(0,0) 


=  0  mod  m 


Using  Theorem  1  we  get 


2/  •  -\  -^  m 

vi(i,j)  £= 


1  +  f(;-i)Vm  +  ll' 


Hence  for  all  pairs  (i,j)      v|(/ J)  5: 


m 


of  the  order  of 


VZ 


1  +   [(f-l)Vm  +  ll 


2        with  a  lower  bound  for  vr 


m 


t-\ 


1/3 


Consider  next  a  lower  bound  on  v3.    Since 

v3*  <  2V6min 

we  would  like  to  ensure  a  lower  bound  again  of  the  order  of  m 
Let  d  =  mm  +  1 

Then  for  every  integer  p 

d?  =   [P)m2/3  +  pmm  +  lmodm 

The  condition 

sl  +  s2dj~'  +  s3dk~'  =  0  mod  m 
(/</<;'</:<?)  is  replaced  by 

Sl  +  s2  +  s3  +    [s2U~i)  +  s3(k-i)]mv3  +   [s2^2l]  +  sAk2  ')] 


m 


2/3 


lm      (6.2) 


/  an  integer.    Hence  it  follows  that 


1/3 


s\  +  s2  +  s3  =  am 

m2/3 
If  a#0  then  s\  +  s2  +  s3  —  ~~ 5 —  by  the  Schwartz  inequality. 

If  a  =  0  this  implies 
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1/3 


s2U-0  +  s2(k-i)  =  |3m1/J. 


If  P^O  this  relation  (again  with  Schwartz  inequality)  implies 

.2/3 


s2  +  si  +  si  >  s\  +  s]  > 


m 


m 


2/3 


(j-i)2  +  (k-i)2      2(r-l): 
We  are  therefore  left  with  the  case  a  =  8  =  0,  but  (5  =  0  implies 


(k-i) 

52  =  ~S3lFo 


and  (6.2)  implies 


*3 


m  -  mv) 


m2/3  =  Im 


But  since  (si,s2,S2)  *  (0,0,0),  equation  (6.3)  implies 

,V3 


\S3\^ 


m' 


2m 


1/3 


(Jfc-0 


k-i-l        j-i- l 


(k-i)(lc-j) 


1*212= 


2m 


1/3 


(k-j)(j-i) 


or 


52  +  sj  +  J|  a: 


2/3 


(r-1)' 


hence 


and 


■3*  >  VmW/2(r-l)2  =  wW/(f-l)V2 


v2  a  v3 


(6.3) 


i.e.  by  choosing  d=mm  +  1  (to  be  an  integer  (3  =  3/)  we  can  guarantee  lower  bounds  of 
order  m173/*  for  both  v^and  V3.  Our  previous  method  using  primes  gave  lower  bounds 
independent  of  r. 
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Comparison 


Prime  Method  <i-Method 


Lower  bound  for  v2*  Vp2  +  p}-X~ O \m 1/2]  m */( t -  1 ) 

Lower  bound  for  (v^)  [o(my3) tO(my3)^  [p{mm/t),  0(m1/3/f)) 

Example:    m  =  248,     r  =  6134  processors^. 

Lower  bound  for    v\  1.4X107  2736 


<3>  There  are  6134  primes  between  9.9  X    106  and  10  X    106  (see  Hardy  &  Wright  [11]). 
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7.   Further  Results 

The  problem  we  have  discussed,  constructing  a  spectral  test  for  the  pairs  (xni,xn J)  to 
be  uniformly  distributed  and  choosing  the  additive  constants  bt  such  that  v|  will  be  suffi- 
ciently large,  was  motivated  by  our  special  interest  in  minimizing  the  variance.  Using  a 
similar  technique  we  can  construct  a  test  for  pairs  of  the  form  (xni,xn+kj+[).  Using  (2.2) 

n i 

xn,i  ~  *o,i  +  7,        j     mod  m 

n  +  k_  i 

x„+k,i+i  ~  *0,i+l  +  "ii+l     a_1        mod  m 

where 

yi  +  r  =  bi+r  +  (a-l)x0J+r  for  r  integer  . 

Therefore 

xn+k,i+l  ~  ir^t+fiP'nji  +    constant  . 

Using  now  the  method  that  led  to  (2.6)  we  get  the  two  dimensional  spectral  test  v$k\i,i+l) 
to  be 


v{k\i,i  +  l)  =      min 

3(^,^)^(0,0) 


vs\  +  s%  \  Si  +  7,-  1yi+laks2  —  0  mod  m 


and  as  in  (2.6)  can  be  computed  using  Algorithm  S. 

A  natural  generalization  is  to  consider  triplets  either  of  the  form  {xni,xnj,xnk) 

or  of  the  general  form  (xnti,xn  +  k:i+l,xn  +  k+S:i  +  l  +  r) 
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As  before 


an-\ 

(n,i  m  x0i  +  7i —   mod  m 

a  — I 


,n  +  k 


*n+k,i+i  =  •*o,l+/  +  'Yj+/        i     m°d  m 


(n  +  k  +  s,i+l  +  r  ™  ^Ci  +  Z  +  j    +   7;  +  /+, 


an  +  k  +  s_-[ 

a-\ 


mod  m 


*/»+*,,+/  ■  It  l1i+iakxn<i  +  c2 


xn+k+s,i+l+r   m  It     1i  +  l  +  ra       SX„;   +   C2 


c1  and  c2  are  constants,  and  the  spectral  test  v$f,s\i,i  +  l,i  +  l  +  r)  will  be  of  the  form 


vik'*\i,i+l,i+l+r)=       min 

3(i^j^3)  "(0,0,0) 


VrJ  +  jf  +  jf  \si  +  y(  1yi+iakS2  +  yi  lyi+l+rak+ss3=0  mod  m 
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8.   Conclusions 

It  is  gratifying  to  have  methods  that  produce  good  lower  bounds  on  v2  for  all  pairs  in 
a  large  ensemble  of  parallel  streams.  The  use  of  prime  bt  yields  a  bound  V2  =  0(m  ) 
which  is  satisfactory  for  many  calculations.  The  very  convenient  method  of  finding  b(  from 
7,  =  ^'_1  yields  good  bounds  0(m%/(t—  1))  for  small  numbers  of  streams,  say  up  to  100.  If 
the  number  of  streams  can  be  set  in  advance  (e.q.  one  for  each  of  a  fixed  number  of 
processes),  then  each  can  be  assigned  its  unique  prime  or  its  unique  7,-. 

For  the  case  discussed  in  the  Introduction  where  an  indefinite  number  of  streams  may 
be  invoked,  as  in  the  general  random  walk  with  branching,  where  the  computational  load 
may  be  shared  unpredictably,  and  where  reproducibility  of  results  is  required,  a  technical 
problem  remains:  how  are  primes  or  powers  of  d  assigned  to  processes?  We  do  not 
understand  how  to  do  that  and  ensure  that  the  same  b  is  not  used  twice.  But  if  one  accepts 
using  b's  twice  with  very  small  probability,  the  quality  of  the  calculation  remains  high. 
Several  schemes  are  possible.  When  one  initiates  a  new  stream,  one  can  "hash"  the  x  and 
b  from  the  parent  stream  to  an  integer  which  points  into  a  large  table  of  primes.  That  is, 
the  state  of  the  parent  is  used  to  give  any  entry  in  the  table  with  equal  chance.  If  one  used 
either  b  or  x  alone,  then  the  new  streams  would  have  a  simple  relation  to  the  old.  For 
example,  if  we  use  b  alone,  then  all  streams  whose  position  on  the  tree  of  streams  involves 
a  fixed  number  of  left  steps  will  have  the  same  b.  If  one  uses  x  alone,  then  every  stream 
started  with  a  particular  x  (admittedly  rare)  will  be  exactly  the  same  stream.  Picking  7 
(instead  of  b)  also  serves  to  mix  x  and  b. 

We  should  also  point  out  that  we  have  been  using,  with  seeming  success,  a  completely 
ad  hoc  scheme  in  this  style  for  generating  new  streams.    It  is 

bj  +  i  =  (abbj  +  cb  mod  m)©  x 

ab  and  bb  are  derived  from  a  different  widely  used  generator.  The  exclusive  or  (©)  opera- 
tion is  used  simply  to  mix  the  x  part  of  the  state  and  so  avoid  the  predictable  use  of  identi- 
cal b's  as  discussed  above.  We  have  tested  this  generator  in  several  ways.  The  most  not- 
able is  in  a  simple  soluble  branching  random  walk.  As  discussed  in  the  introduction,  des- 
cendent  walkers  were  assigned  new  random  streams.  Eight  moments  of  the  distribution  of 
walkers  were  calculated  correctly,  a  pragmatic  and  strigent  test  of  the  independence  of  dif- 
ferent streams. 
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Nevertheless,  we  regard  our  analytic  results  as  important,  if  only  as  a  beginning  for  a 
theory  of  parallel  generators.  We  hope,  in  the  future,  to  provide  some  bounds  on  correla- 
tions between  xtj  and  xi+mj+n  and  in  so  doing  to  put  our  proposals  on  a  still  more  satisfac- 
tory basis.  It  would  also  be  good  to  have  similar  theory  and  bounds  for  parallel  versions  of 
other  generators  such  as  Tausworthe,  additive  Fibonacci  and  composite  generators. 

Finally,  we  note  that  these  methods  can  also  be  used  to  improve  the  quality  of  purely 
sequential  generators.  It  is  well  known  (or  ought  to  be)  that  linear  congruential  generators 
that  use  powers  of  two  as  a  modulus  have  strong  sequential  correlations  between  any  entry 
and  a  successor  delayed  by  a  (moderately  large)  power  of  two.  In  our  testing,  we  have 
verified  that  such  a  correlation  is  serious  for  delays  of  1024  and  larger.  There  are  cases  of 
errors  in  Monte  Carlo  calculations  in  which  exactly  such  numbers  of  variables  are  used 
repeatedly  in  a  fixed  pattern.  A  simple  cure  is  to  change  the  additive  constant  using  the 
prescriptions  given  here. 
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